FIERCE (Framework for InfERence of veloCity of the Entropy) is a computational pipeline that constructs visual representations (“streamplots”) of both the topological structure and the inferred future dynamics of the differentiation potency landscape of cell populations from scRNA-seq data. It has been designed according to a “bottom-up” strategy that is based on the following rationale: if we assume that the differentiation potency of cells is efficiently approximated by their transcriptional entropy, it is possible to reconstruct dynamic trajectories directly on the potency landscape of cell populations through (i) the computation of the rate of change of the transcriptional entropy of cells through time and (ii) the prediction of the future movements of cells on the entropy space based on such rate of change.
We implemented this concept into a three-steps process:
In the first step, the future spliced transcript counts of all genes in all cells are predicted by summing the observed spliced counts to the corresponding RNA velocities (defined as the rate of change of spliced counts over time; La Manno et al, Nature 2018).
In the second step, the transcriptional entropies of cells are computed as their respective signaling entropy scores (Teschendorff et al, Nat Comm 2017), and the future dynamics of such scores are inferred. Specifically, the local signaling entropy scores of all genes in all cells are computed from both the observed and the future spliced counts, on the basis of their distribution over the edges of a cell-shared genome wide protein-protein interaction (PPI) network. The observed local signaling entropy scores are then subtracted from the corresponding future scores to obtain the “velocity of the entropy” of each gene in each cell, i.e., the rate of change of its local signaling entropy over time.
In the third and final step, a UMAP embedding is constructed directly from the observed local signaling entropy scores, to obtain a structural representation of the differentiation potency landscape of the system; finally, the vector field of the velocity of the entropy is constructed on the resulting embedding, under the assumption that, as the dynamic process unfolds, each cell is expected to move towards the neighbors whose signaling entropy is most correlated with its velocity of the entropy.
FIERCE is a hybrid computational pipeline that has been developed to operate in the R environment, but performs all computations on an object of type anndata (Virshup et al, bioRxiv 2021), that is commonly used in the python environment to handle large annotated data matrices. This solution was adopted to integrate the signaling entropy computation, that is performed in the R environment by the SCENT package (Teschendorff et al, Nat Comm 2017), with RNA velocity computation, that is performed by the scVelo python package (Bergen et al, Nat Biotech 2020) on an anndata object. For this reason, FIERCE occasionally interfaces with python through the reticulate package, mainly for performing velocity-related tasks and for handling the anndata object.
An object of type anndata typically contains a main data matrix and several additional layers that have the same dimensions of the main matrix but can contain different types of data. In the context of the computational analysis of scRNA-seq data, the main matrix typically contains the total transcript counts of each gene in each cell, while the additional layers are often used to store specific types of counts, e.g., spliced and unspliced counts, that are fundamental for RNA velocity computation. To be processed by FIERCE, an anndata object must necessarily contain both the spliced and the unspliced counts layers.
Any anndata object can be analyzed with FIERCE, as long as it contains the necessary spliced and unspliced layers, and it is saved on disk as an h5ad file. It can be directly loaded into the R workspace through the dedicated function:
adata <- load_h5ad("adata.h5ad")
If a pre-made anndata object satisfying the prerequisite specified above is not already available to the user, it can be created de novo through the dedicated build_adata_object function (see Creating a new anndata object from scratch).
Once a suitable anndata object has been loaded into the R workspace, it will be analyzed by FIERCE according to the three-steps pipeline described in The FIERCE analysis workflow. The results of the pipeline can be saved at any time by writing the anndata object to an h5ad file through the dedicated function:
save_h5ad(adata, "adata.h5ad")
In The FIERCE analysis workflow section of this tutorial we will be using the test dataset included in the FIERCE package. It contains a random subset of 500 cells of the scRNA-seq dataset sequenced by Bastidas-Ponce et al, Development 2019 to study the differentiation process of endocrine cells from their ductal progenitors in the mouse pancreas. It can be loaded into the R workspace through the following command:
adata <- load_test_dataset("pancreas")
If a pre-made anndata object is not already available to the user, it can be created de novo through the dedicated build_adata_object function, that generates a suitable anndata object from pre-existing loom files containing the spliced and unspliced transcript counts matrices of all samples. Such loom files must be necessarily generated beforehand with one of the dedicated functions of velocyto (La Manno et al, Nature 2018). We will explain how the build_adata_object function works with a hypothetical example. Let’s say a scRNA-seq dataset includes two samples, whose spliced and unspliced counts matrices are saved in two separate loom files, namely “sample_1.loom” and “sample_2.loom”; then a new anndata object containing both samples merged together can be created as follows:
adata <- build_adata_object(c("sample_1.loom", "sample_2.loom"))
The build_adata_object function can also be used to add pre-computed cell annotations and embedding coordinates to the newly generated anndata object, where they will be stored into dedicated slots. Such data can be either provided as separate data-frames, or directly retrieved from a Seurat object. The latter is particularly convenient in case the user has already analyzed the data with the Seurat R package (Satija et al, Nat Biotech 2015), and wishes to start the FIERCE analysis with the computed principal components, or to color the cells according to the computed clusters. Following the example above, let’s say we want to import the principal components and the UMAP coordinates that we have previously computed during a standard Seurat analysis, as well as the sample and cluster identities stored in the metadata slot of the Seurat object:
adata <- build_adata_object(loom_file=c("sample_1.loom", "sample_2.loom"),
Seurat_object=name_of_Seurat_object,
replace_matrix=TRUE,
matrix_slot="counts",
Seurat_cellnames_prefix=c("Sample1_", "Sample2_"),
Seurat_cellnames_suffix=c("-1"),
sample_column="Sample",
metadata_keys=c("Sample", "Clusters"),
reduction_keys=c("pca", "umap"))
In the previous command, replace_matrix is set to TRUE to retrieve the gene expression matrix from the Seurat object (specifically, from the “counts” slot of the active assay), and to add it to the new anndata object as the main data matrix; note that by default a copy of the spliced counts matrix retrieved from the loom files is set as the main data matrix instead. The parameters Seurat_cellnames_prefix and Seurat_cellnames_suffix are used to specify the lists of unique strings that, respectively, precede and follow the barcodes in each cell name in the Seurat object. They are necessary to interface the Seurat object with the anndata object, that uses a different nomenclature. In our example, let’s say that the cell names in the Seurat object contain the sample name, followed by an underscore, followed by the cell barcode, followed by the suffix “-1”. Examples of four cells coming from different samples may then be:
Sample1_ACGTACTGGG-1
Sample1_AAATGCCGAT-1
Sample2_ATGGGCACGT-1
Sample2_GCATCCGATC-1
In this case, we have two unique strings that precede the barcodes (“Sample1_” and “Sample2_”) and just one unique string that follows the barcodes (“-1”). The parameter sample_column is used to specify the name of the column of the metadata slot in the Seurat object that contains the sample identity of each cell; note that the sample names in such column must necessarily correspond to the names of the respective loom files, excluding the extension “.loom” (thus, “sample_1” and “sample_2” in our example). The parameters metadata_keys and reduction_keys are used to specify the names of, respectively, the metadata columns and the cell embeddings that must be imported from the Seurat object.
WARNING: the sample_column parameter must be ALWAYS set, regardless of the metadata columns that must be imported, since it is used to interface the Seurat object with the anndata object like the Seurat_cellnames_prefix and the Seurat_cellnames_suffix parameters. If the user wishes to additionally import the sample labels contained in the metadata slot of the Seurat object as a cell annotation, the name of the respective column must be specified again in metadata_keys.
This tutorial covers the passages required for a standard FIERCE analysis. First, the package must be loaded, and the R workspace must be prepared for the analysis (a set of functions will be loaded for interfacing R with python):
library(FIERCE)
set_FIERCE_workspace()
We will be using the test dataset included in the package, that contains a random subset of 500 cells of the scRNA-seq dataset sequenced by Bastidas et al, Development 2019 to study the differentiation process of endocrine cells from their ductal progenitors in the mouse pancreas. It can be loaded in the R workspace through the following command:
adata <- load_test_dataset("pancreas")
Below we can see the expected genealogy of this differentiation process. In developing mouse embryos, the endocrine cells originate from ductal progenitors, passing through two intermediate states, i.e., endocrine progenitors and pre-endocrine cells. In particular, the activation of the Ngn3 gene determines the commitment of endocrine progenitors to the endocrine fate, and thus represents the pivotal point of this trajectory. The differentiation potency is very high in ductal cells, and progressively decreases as the cells gradually specialize into the endocrine fate and lose their ability to transform into different cell types.
Note that, although this process is very well characterized, the test dataset included in FIERCE is too small to exhaustively represent the complete dynamics of all the key genes involved in the development of these cells, thus “incomplete” results are to be expected. It has been assembled just for the sake of this tutorial, to show the usage of the functions of FIERCE.
Although FIERCE can perform all computations on “native” transcript counts, the default mode works on the first order moments of both spliced and unspliced counts. The first order moments (Bergen et al, Nat Biotech 2020) can be described as a “smoothed” version of transcript counts that takes into account the expression profiles of nearby cells, meaning that the expression level of a gene in a given cell is increased or decreased depending on whether the expression of the same gene is increasing or decreasing in neighboring cells on a neighborhood graph. The advantage of the first order moments is that they better capture the dynamics of genes in the population. The downside is that they require the preliminary computation of a neighborhood graph. For this purpose, FIERCE includes the perform_preprocessing function, that can be used to perform a standard bioinformatic analysis with Scanpy (Wolf et al, Genome Biology 2018) on the anndata object. By default, this function performs all the steps of a standard Scanpy analysis, including quality control, data normalization, data scaling, the principal component analysis (PCA), and the computation of a neighborhood graph on the resulting components. Such graph is also used to perform a clustering analysis and to compute a UMAP embedding. Here we will let the function perform all the default steps excluding the quality control, the clustering analysis and the computation of the differentially expressed genes, that are not necessary for the computation of the first order moments:
perform_preprocessing(adata, compute_QC_metrics=FALSE,
compute_MT_fraction=FALSE,
perform_clustering=FALSE,
find_DEGs=FALSE,
color_as="cell_types")
The perform_preprocessing function outputs a series of plots generated by Scanpy (saved in the “./FIERCE_results/scanpy_preprocessing” directory), including the projection of cells on the first principal components and on the UMAP embedding. Cells can be colored according to any annotation contained in the anndata object (in the “obs” slot) by specifying the respective name in the dedicated color_as parameter. Multiple annotations can be specified at once, and the order and color of the respective categories can be set through the lab_order and palette parameters. Note that, for any categorical annotation of the cells, the colors and order of the labels need to be set only once, then they will be stored in the anndata object and henceforth re-used by the FIERCE plotting functions, until the user specifies new colors or a new order. Here we specify the cell type annotation (“cell_types”), to visualize the succession of cell states of the pancreas differentiation process (see the UMAP below). The same annotation will be specified in all the downstream plotting functions.
As expected, the structure of the trajectory is already distinguishable in the UMAP, with the endocrine progenitors originating from the ductal cells, the pre-endocrine cells originating from the endocrine progenitors after the activation of the Ngn3 gene, and the pre-endocrine cells finally branching into the different endocrine cell types. This confirms that the neighborhood graph is appropriate and can be used for the downstream steps of the analysis.
Once a neighborhood graph has been computed, the anndata object is ready to be analyzed by FIERCE, starting with the first step of the pipeline, i.e., the inference of the future transcriptional states of cells. This is performed by the compute_future_states function, that works in three steps:
The computations of steps 1 and 2 are performed with the scVelo python package (Bergen et al, Nat Biotech 2020). All three steps are performed by this command:
compute_future_states(adata)
The vector field of RNA velocity can be optionally built on the previously computed UMAP embedding, and visualized on a streamplot (saved in the “./FIERCE_results/velocity_field_streamplots” directory) with the following command:
plot_velocity(adata, color_as="cell_types")
The RNA velocity vector field correctly depicts the expected trajectory, thus confirming the reliability of the computed future transcriptional states. The objective of FIERCE is now to “transpose” this vector field on the signaling entropy space, to depict the trajectories of cells directly on a representation of the potency landscape of this biological system.
The core step of the FIERCE pipeline involves the computation of the velocity of the entropy. The SCENT R package (Teschendorff et al, Nat Comm 2017) is used to compute the local signaling entropy scores of all genes in all cells, first using their observed spliced first order moments as expression signal, then using the previously computed future moments. Then, the observed local entropy scores are subtracted from the future scores, to obtain the matrix containing the velocity of the entropy of each gene in each cell. The whole computation is performed by the following command:
compute_signaling_entropy(adata, species="Mouse",
compute_potency_states=TRUE,
phenotype_annotation="cell_types",
n_cores=20)
By default, the signaling entropy scores are computed using the “high confidence” built-in PPI network of FIERCE, that contains all the protein-protein interactions of the STRING database with a minimum confidence score of 0.7. Such network is available for both human and mouse; human is the default, so if the dataset is murine it must be specified through the species parameter. Optionally, the compute_signaling_entropy function also identifies the potency states of the cell population based on the distribution of the total signaling entropy scores of cells across the categories of an annotation of choice (specified through the phenotype_annotation parameter; typically the cell types are used for this purpose). The potency states can be described as discrete distributions of cells with similar signaling entropy scores, each with a given mean and variance. They usually represent “stable states” within the dynamic process, thus they can be very useful for catching a first glance of the structure of the potency landscape of the population before a more detailed view is provided by the streamplot built by FIERCE in the final step. Note that the potency states are numbered according to their mean entropy score, but in decreasing order, meaning that the first states have the highest mean entropy, while the last have the lowest.
The signaling entropy score is very complex, thus this is the most computationally intensive step of the analysis. Several hours are typically necessary, thus we recommend using the n_cores parameter to allocate as many available cores as possible to this task. To avoid losing data in the case of an accidental interruption, FIERCE subdivides the dataset into smaller batches (5000 cells by default) and saves the results after the computations for each batch are concluded. These results are saved into a temporary python dictionary that is placed into the current working directory, and is automatically removed after all the computations have been done. Since the test dataset includes just 500 cells, the creation of the dictionary will be skipped, and the cells will be analyzed all together in just one batch.
Optionally, it is possible to visualize the results of the signaling entropy computation through the following command, that outputs a series of descriptive plots (saved in the “./FIERCE_results/signaling_entropy_plots” directory):
plot_entropy_results(adata, phenotype_annotation="cell_types")
In particular, to explore the potency structure of the population it is useful to visualize the distribution of the total signaling entropy scores and of the potency states of cells across the sequence of cell types that define the dynamic process.
Looking at these plots, it is immediately evident how the total signaling entropy scores perfectly reflect the differentiation potency of cells, with the ductal cells showing the highest scores, the endocrine cells showing the lowest scores, and the endocrine progenitors and the pre-endocrine cells showing intermediate scores. Based on such distribution of the total entropy scores, 3 potency states have been identified in this dataset, with potency state 1 including the cells with highest entropy, potency state 2 including the cells with intermediate entropy, and potency state 3 including the cells with lowest entropy.
An interesting property of signaling entropy, summarized by the plots shown below, is its tight relationship with gene expression. Precisely, the total signaling entropy scores tend to be directly proportional to the number of expressed genes per cell, but inversely proportional to the mean gene expression per cell. This property is explained by the very definition of signaling entropy, that is directly proportional to the dispersion of the gene expression signal of cells over the different pathways of the PPI network. Briefly, high entropy cells tend to express more genes, but with lower intensity, while low entropy cells tend to express fewer genes, but with higher intensity.
The final step of the FIERCE pipeline consists in the construction of the velocity of the entropy vector field, which is performed in two phases. First, a UMAP embedding is directly constructed from the observed local signaling entropy scores of cells with the Scanpy python package (Wolf et al, Genome Biology 2018):
compute_entropy_UMAP(adata, color_as=c("cell_types",
"total_entropies_observed","Potency_states_observed"),
palette=list(NULL,NULL,c("#b2e2e2","#66c2a4","#238b45")))
This time, in addition to the cell type labels, we color the cells according to their total signaling entropy scores and to their potency state labels (previously assigned by the compute_signaling_entropy function). The shape of the resulting embedding is very similar to its counterpart built from gene expression, but shows a thinner, more “elongated” form, along which cells distribute according to a decreasing signaling entropy gradient. As expected, such gradient is also accompanied by the passage from potency state 1 to potency state 3, passing through potency state 2. This pattern recapitulates the expected structure of the differentiation potency landscape of this dynamic process, with the ductal cells progressively losing their potency as they transform into the endocrine progenitors first, then into the pre-endocrine cells, and finally into the mature endocrine cells.
By default, the compute_entropy_UMAP function computes 50 principal components from the local entropy scores, and then uses all such components to construct a neighborhood graph with 30 neighbors, from which the UMAP embedding is finally built. However, the user may specify any number of principal components to use for the graph construction through the n_pcs parameter, as well as any number of neighbors for the graph itself through the n_neighbors parameter; the results of each combination of these two parameters are saved into separate sub-directories (that are placed into the “./FIERCE_results/UMAP_from_entropy” directory).
Once the UMAP embedding has been computed from the entropy scores, the second phase can be performed through the following command, that uses specific functions of the scVelo python package (Bergen et al, Nat Biotech 2020) to construct the velocity of the entropy vector field on the embedding (saved in the “./FIERCE_results/velocity_field_streamplots” directory):
compute_graph_and_stream(adata, color_as=c("cell_types",
"total_entropies_observed","Potency_states_observed"))
This “streamplot” represents the final output of the FIERCE pipeline. The key advantage it offers is the possibility to visualize cellular attributes of interest directly on a reconstruction of the differentiation potency landscape of the biological system, and to analyze their distribution in relation to the predicted future trajectories of cell subpopulations on such landscape. In this case, we can see that the main stream of arrows strictly follows the decreasing entropy gradient from the ductal cells to the endocrine cells, as well as the passage from potency state 1 to potency state 3. This dynamic process has been correctly reconstructed in an image that depicts both its structure and its dynamics.
By default, the compute_graph_and_stream function constructs the vector field on the last UMAP embedding that has been computed by the compute_entropy_UMAP function; however, the user may choose any embedding that has been previously computed by said function by specifying the respective combination of principal components / number of neighbors through, respectively, the n_pcs_emb and the n_neighbors_emb parameters.
Optionally, it is also possible to construct the velocity graph (i.e., the structure underneath the vector field) using a second independent neighborhood graph with a different number of neighbors; such graph can be defined through the n_pcs_graph and the n_neighbors_graph parameters. The velocity arrow of a cell points towards the neighbors whose signaling entropy is most correlated with its velocity of the entropy, thus increasing or decreasing the number of neighbors can have a significant impact. In general, increasing the number of neighbors results in longer arrows that coalesce into longer and more continuous streams, that can be useful to better visualize the long-term trajectories of larger datasets. In smaller datasets, instead, decreasing the number of neighbors can help to better visualize the short-term trajectories. Decoupling the neighborhood graph used for the velocity graph construction from the graph previously used for the UMAP construction allows the user to freely change the number of neighbors considered by the algorithm for determining the direction of the velocity arrows without having necessarily to change also the underlying embedding. In the example below, the vector field was constructed by setting n_pcs_graph=50 and n_neighbors_graph=100 (instead of 30 like in the streamplot above).
Of course, this test dataset is too small and simple to show any significant difference, however we can still see that in certain spots, like in the ductal cells and in the pre-endocrine cells, a few arrows that previously showed a slightly curvy pattern now follow a straighter trajectory.
To conclude, we remind that, at the end of the analysis, the anndata object can be saved on disk with the following command:
save_h5ad(adata, "adata.h5ad")
All the results of the analysis are stored in dedicated slots inside the anndata object, which can be re-loaded into the R environment at any time to create new plots.